Q1

library(data.table)
pm25_2004<-fread('ad_viz_plotval_data_2004.csv')
pm25_2019<-fread('ad_viz_plotval_data_2019.csv')
dim(pm25_2004);dim(pm25_2019)
## [1] 19233    20
## [1] 53156    20
colnames(pm25_2004);colnames(pm25_2019)
##  [1] "Date"                           "Source"                        
##  [3] "Site ID"                        "POC"                           
##  [5] "Daily Mean PM2.5 Concentration" "UNITS"                         
##  [7] "DAILY_AQI_VALUE"                "Site Name"                     
##  [9] "DAILY_OBS_COUNT"                "PERCENT_COMPLETE"              
## [11] "AQS_PARAMETER_CODE"             "AQS_PARAMETER_DESC"            
## [13] "CBSA_CODE"                      "CBSA_NAME"                     
## [15] "STATE_CODE"                     "STATE"                         
## [17] "COUNTY_CODE"                    "COUNTY"                        
## [19] "SITE_LATITUDE"                  "SITE_LONGITUDE"
##  [1] "Date"                           "Source"                        
##  [3] "Site ID"                        "POC"                           
##  [5] "Daily Mean PM2.5 Concentration" "UNITS"                         
##  [7] "DAILY_AQI_VALUE"                "Site Name"                     
##  [9] "DAILY_OBS_COUNT"                "PERCENT_COMPLETE"              
## [11] "AQS_PARAMETER_CODE"             "AQS_PARAMETER_DESC"            
## [13] "CBSA_CODE"                      "CBSA_NAME"                     
## [15] "STATE_CODE"                     "STATE"                         
## [17] "COUNTY_CODE"                    "COUNTY"                        
## [19] "SITE_LATITUDE"                  "SITE_LONGITUDE"
sapply(pm25_2004,class);sapply(pm25_2019,class)
##                           Date                         Source 
##                    "character"                    "character" 
##                        Site ID                            POC 
##                      "integer"                      "integer" 
## Daily Mean PM2.5 Concentration                          UNITS 
##                      "numeric"                    "character" 
##                DAILY_AQI_VALUE                      Site Name 
##                      "integer"                    "character" 
##                DAILY_OBS_COUNT               PERCENT_COMPLETE 
##                      "integer"                      "numeric" 
##             AQS_PARAMETER_CODE             AQS_PARAMETER_DESC 
##                      "integer"                    "character" 
##                      CBSA_CODE                      CBSA_NAME 
##                      "integer"                    "character" 
##                     STATE_CODE                          STATE 
##                      "integer"                    "character" 
##                    COUNTY_CODE                         COUNTY 
##                      "integer"                    "character" 
##                  SITE_LATITUDE                 SITE_LONGITUDE 
##                      "numeric"                      "numeric"
##                           Date                         Source 
##                    "character"                    "character" 
##                        Site ID                            POC 
##                      "integer"                      "integer" 
## Daily Mean PM2.5 Concentration                          UNITS 
##                      "numeric"                    "character" 
##                DAILY_AQI_VALUE                      Site Name 
##                      "integer"                    "character" 
##                DAILY_OBS_COUNT               PERCENT_COMPLETE 
##                      "integer"                      "numeric" 
##             AQS_PARAMETER_CODE             AQS_PARAMETER_DESC 
##                      "integer"                    "character" 
##                      CBSA_CODE                      CBSA_NAME 
##                      "integer"                    "character" 
##                     STATE_CODE                          STATE 
##                      "integer"                    "character" 
##                    COUNTY_CODE                         COUNTY 
##                      "integer"                    "character" 
##                  SITE_LATITUDE                 SITE_LONGITUDE 
##                      "numeric"                      "numeric"
summary(pm25_2004$`Daily Mean PM2.5 Concentration`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -0.10    6.00   10.10   13.13   16.30  251.00
summary(pm25_2019$`Daily Mean PM2.5 Concentration`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.200   4.000   6.500   7.739   9.900 120.900

As a result, the 2004 dataset includes 19233 rows and 20 columns and the 2019 dataset includes 53156 rows and 20 columns. The colnames names and type of the columns are both shown above. Besides, the interested variable, daily mean PM2.5 concentration are summaried as mean, median, 1th and 3rd quantile, minimum and maximum. No NAs are existed in the Daily Mean PM2.5 Concentration variable. However, some negative values are existed, which are abnomal due to concentration of PM2.5 cannot be negative.

Q2

pm25<-rbind(pm25_2004,pm25_2019)
pm25$Year<-rep(c(2004,2019),c(nrow(pm25_2004),nrow(pm25_2019)))
head(pm25)
##          Date Source  Site ID POC Daily Mean PM2.5 Concentration    UNITS
## 1: 01/01/2004    AQS 60010007   1                           11.0 ug/m3 LC
## 2: 01/02/2004    AQS 60010007   1                           12.2 ug/m3 LC
## 3: 01/03/2004    AQS 60010007   1                           16.5 ug/m3 LC
## 4: 01/04/2004    AQS 60010007   1                           19.5 ug/m3 LC
## 5: 01/05/2004    AQS 60010007   1                           11.5 ug/m3 LC
## 6: 01/06/2004    AQS 60010007   1                           32.5 ug/m3 LC
##    DAILY_AQI_VALUE Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
## 1:              46 Livermore               1              100
## 2:              51 Livermore               1              100
## 3:              60 Livermore               1              100
## 4:              67 Livermore               1              100
## 5:              48 Livermore               1              100
## 6:              94 Livermore               1              100
##    AQS_PARAMETER_CODE                     AQS_PARAMETER_DESC CBSA_CODE
## 1:              88502 Acceptable PM2.5 AQI & Speciation Mass     41860
## 2:              88502 Acceptable PM2.5 AQI & Speciation Mass     41860
## 3:              88502 Acceptable PM2.5 AQI & Speciation Mass     41860
## 4:              88502 Acceptable PM2.5 AQI & Speciation Mass     41860
## 5:              88502 Acceptable PM2.5 AQI & Speciation Mass     41860
## 6:              88502 Acceptable PM2.5 AQI & Speciation Mass     41860
##                            CBSA_NAME STATE_CODE      STATE COUNTY_CODE  COUNTY
## 1: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
## 2: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
## 3: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
## 4: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
## 5: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
## 6: San Francisco-Oakland-Hayward, CA          6 California           1 Alameda
##    SITE_LATITUDE SITE_LONGITUDE Year
## 1:      37.68753      -121.7842 2004
## 2:      37.68753      -121.7842 2004
## 3:      37.68753      -121.7842 2004
## 4:      37.68753      -121.7842 2004
## 5:      37.68753      -121.7842 2004
## 6:      37.68753      -121.7842 2004

Q3

library(leaflet)
leaflet(data=pm25)%>%
  addTiles()%>%
  addCircles(~SITE_LONGITUDE, ~SITE_LATITUDE,
             color = ~ifelse(Year==2004,'blue','red'))

In this figure, the blue points represent the locations of the sites for 2004 and the red points represent the locations of the sites for 2019. According to the map, it reveals that the number of sites increase obviously from 2004 to 2019. Besides, the sites for 2019 are almost located at coastal.

Q4

As shown in Q1, it is found that no NAs are existed in the variable Daily Mean PM2.5 Concentration, but some negative values are existed in the variable. Thus the proportion of the negative values are shown as:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()   masks data.table::between()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()
pm25%>%mutate(neg_value=`Daily Mean PM2.5 Concentration`<0)%>%
  group_by(neg_value)%>%summarise(prop=n()/nrow(pm25))
## # A tibble: 2 × 2
##   neg_value    prop
##   <lgl>       <dbl>
## 1 FALSE     0.996  
## 2 TRUE      0.00391

As a result, it indicates that only 0.391% values of Daily Mean PM2.5 Concentration are negative, which may caused by some mistakes. To investigate the temporal pattern, the date of negative values are shown as:

pm25%>%filter(`Daily Mean PM2.5 Concentration`<0)%>%
  mutate(month=substr(Date,1,2))%>%
  ggplot(aes(month))+geom_bar()

As a result, it is revealed that the negative values are often appeares in March and April, but negative in November is lowest.

Q5

pm25%>%ggplot(aes(`Daily Mean PM2.5 Concentration`))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Firstly, the distribution of the PM2.5 concentration is investigated. As a result, the distribution is right skewed, which indicates that log-transformation is suitable here.

1. State level

pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
  group_by(Year)%>%
  summarise(log_pm25_mean=mean(log(`Daily Mean PM2.5 Concentration`)),
            log_pm25_sd=sd(log(`Daily Mean PM2.5 Concentration`)))
## # A tibble: 2 × 3
##    Year log_pm25_mean log_pm25_sd
##   <dbl>         <dbl>       <dbl>
## 1  2004          2.27       0.840
## 2  2019          1.81       0.750
pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
  mutate(Year=as.factor(Year))%>%
  ggplot(aes(x=Year,y=`Daily Mean PM2.5 Concentration`))+
  geom_boxplot()+scale_y_log10()

As the boxplot log-transformed shown, the level of PM2.5 is decrease from 2004 to 2019, which is also validated by the averages and standard deviances of log(PM2.5) calculated as summary statistic.

County level

pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
  group_by(Year,COUNTY)%>%
  summarise(log_pm25_mean=mean(log(`Daily Mean PM2.5 Concentration`)))%>%
  pivot_wider(names_from = 'Year',
              values_from=log_pm25_mean)%>%
  mutate(diff=`2019`-`2004`)
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
## # A tibble: 51 × 4
##    COUNTY       `2004` `2019`   diff
##    <chr>         <dbl>  <dbl>  <dbl>
##  1 Alameda       2.22   1.83  -0.385
##  2 Butte         2.01   1.65  -0.366
##  3 Calaveras     1.77   1.58  -0.191
##  4 Colusa        1.97   1.69  -0.279
##  5 Contra Costa  2.27   1.82  -0.452
##  6 Del Norte     0.954  1.35   0.393
##  7 El Dorado     0.875  0.723 -0.152
##  8 Fresno        2.36   1.86  -0.500
##  9 Humboldt      1.87   1.74  -0.131
## 10 Imperial      2.24   2.11  -0.128
## # … with 41 more rows
pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
  mutate(Year=as.factor(Year))%>%
  group_by(Year,COUNTY)%>%
  summarise(log_pm25_mean=mean(log(`Daily Mean PM2.5 Concentration`)))%>%
  ggplot(aes(y=COUNTY,x=log_pm25_mean,colour=Year))+
  geom_point()
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

As shown in the scatter plot, the points representing 2004 are almost located right of points representing 2019, which indicates that the PM2.5 for 2019 is lower than 2004 for most of counties. Besides, difference of log_PM2.5 between 2004 and 2019 are also calculated as summary statistic, which are almost lower than 0, which also indicates that for almost counties, PM2.5 in 2019 is lower than 2004.

Site in Los Angeles level

pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
  group_by(Year,`Site Name`)%>%
  summarise(log_pm25_mean=mean(log(`Daily Mean PM2.5 Concentration`)))%>%
  pivot_wider(names_from = 'Year',
              values_from=log_pm25_mean)%>%
  na.omit()%>%
  filter(`Site Name`!='')
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
## # A tibble: 77 × 3
##    `Site Name`                 `2004` `2019`
##    <chr>                        <dbl>  <dbl>
##  1 Anaheim                      2.71   2.08 
##  2 Aqua Tibia Wilderness        1.68   1.18 
##  3 Azusa                        2.74   2.06 
##  4 Bakersfield-Airport (Planz)  2.62   2.35 
##  5 Bakersfield-California       2.81   2.25 
##  6 Bakersfield-Golden / M St    2.83   2.29 
##  7 Big Bear                     2.08   1.38 
##  8 Bliss SP                     0.787  0.431
##  9 Brawley-220 Main Street      2.07   2.01 
## 10 Calexico-Ethel Street        2.43   2.15 
## # … with 67 more rows

Similarly in counties level, the average log_PM2.5 is calculated as summary statistic. To make the comparison reasonable, only sites with observations in both 2004 and 2019 are retained. Due to the number of sites are too many, thus heat map is used rather than scatter plot:

pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
    group_by(Year,`Site Name`)%>%
    summarise(log_pm25_mean=mean(log(`Daily Mean PM2.5 Concentration`)))%>%
    pivot_wider(names_from = 'Year',
                values_from=log_pm25_mean)%>%
    na.omit()%>%
    filter(`Site Name`!='')%>%
    pivot_longer(cols=c(`2004`,`2019`),values_to='log_PM2.5')%>%
    ggplot(aes(x=name,y=`Site Name`,fill=`log_PM2.5`))+
  geom_tile()+
  scale_fill_gradient(low='white',high='black')
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

As a result, the tiles representing 2019 is light than 2004, which indicates that PM2.5 in 2004 is higher than 2019.

Summary, daily concentrations of PM25 has decreased from 2004 to 2019.